# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')
pc = project.project_context
In the first notebook Part 1 - Data Exploration
we've explored the Fashion-MNIST dataset from the Data Asset Exchange. In this
notebook we will train three machine learning classifiers that could be used to identify fashion and clothing items and compare their performance. Throughout this notebook we will utilize the scikit-learn Machine Learning library.
Before you run this notebook complete the following steps:
When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:
More -> Insert project token
in the top-right menu sectionThis should insert a cell at the top of this notebook similar to the example given above.
If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below
# Define required imports
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from warnings import filterwarnings
filterwarnings('ignore')
We start by reading in the training dataset from fashion-mnist_train.csv
.
# Training dataset file name
DATA_PATH = 'fashion-mnist_train.csv'
# Create method to find filepath based on filename
def get_file_handle(fname):
# Project data path for the raw data file
data_path = project.get_file(fname)
data_path.seek(0)
return data_path
# Usepandas to read the data
data_path = get_file_handle(DATA_PATH)
data = pd.read_csv(data_path).values
# Preview data (label, followed by pixel data)
data
Save the pixel data and labels into two arrays.
# Save the pixel data as "pixel"
pixel = data[:, 1:]
# Save the label data as "label"
label = data[:, 0]
We are going to train three Machine Learning algorithms using this data that could be used to identify fashion and clothing items.
Define a helper function named calculate_metrics
, which calculates the following metrics:
display_metrics
and display_scores
are used throughout the notebook to display metrics.
def calculate_metrics(label, label_predict):
"""
Calculate accuracy, precision, recall and f-score
"""
acc_score = accuracy_score(label, label_predict)
pre_score = precision_score(label, label_predict, average='weighted')
rec_score = recall_score(label, label_predict, average='weighted')
f_score = f1_score(label, label_predict, average='weighted')
return (acc_score, pre_score, rec_score, f_score)
def display_metrics(label, label_predict):
"""
Calculate and display accuracy, precision, recall and f-score
"""
scores = calculate_metrics(label, label_predict)
print("Model Accuracy : {}".format(scores[0]))
print("Model Precision: {}".format(scores[1]))
print("Model Recall : {}".format(scores[2]))
print("Model F-Score : {}".format(scores[3]))
def display_scores(scores):
"""
Display scores (e.g. accuracy, precision, etc.) and calculate mean
and standard deviation
"""
print("Scores : {}".format(scores))
print("Mean : {}".format(scores.mean()))
print("Standard deviation: {}".format(scores.std()))
A decision tree is a supervised machine learning technique that can be used to classify data. A decision tree consists of three components: internal nodes, edges/branches and leaf nodes.
Yes
would be an edge to another node that might determine whether the sleeves are long. Pullover
if the previous test determined that the garment has long sleeves. We will use a scikit-learn implementation of the Decision Tree Classifier and configure it to build a Decision Tree classifier from the pixel and label training data. As hyperparameters) we use a combination that performed well in these benchmarks and yields results quickly. We specify an arbitrary random number generator seed of 42 to allow for reproducible results.
# Build an sklearn.tree.DecisionTreeClassifier from the training dataset
decision_tree = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=10, random_state=42)
# Train the classifier
decision_tree.fit(pixel, label)
Test the decision tree classifier using the pixel training data. For illustrative purposes we also display the first 20 predictions and expected results to allow for a quick visual comparison.
# Test classifier using the pixel data
label_predict = decision_tree.predict(pixel)
# Review the first 20 labels and predicted labels
print('Correct labels : {}'.format(label[:20]))
print('Predicted labels: {}'.format(label_predict[:20]))
Looking at this small sample, we can already see that not all predictions were correct. Let's calculate and display model accuracy, precision, recall and F-score for the trained classifier.
# display model performance stats
display_metrics(label, label_predict)
The trained model has good accuracy, precision, recall and F-score.
Cross-validation), sometimes called rotation estimation or out-of-sample testing, is one of the various model validation techniques for assessing how well the results of a statistical analysis generalize to an independent data set. It is a statistical method used to estimate the performance of machine learning models. The goal of cross-validation is to test the model's ability to predict on new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.
The general procedure for k-fold cross validation is as follows:
Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.
This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds.
In this notebook we perform 3-fold cross validation which means here k=3.
# Scaled Features not required for Decision Tree
decision_tree_scores = cross_val_score(decision_tree, pixel, label, cv=3, scoring="accuracy")
display_scores(decision_tree_scores)
label_predcv = cross_val_predict(decision_tree, pixel, label, cv=3)
decision_tree_cv = calculate_metrics(label,label_predcv)
The prediction performance of the model is fairly ok. The accuracy, precision, recall and F-score are around 80%. Let's try a different approach and see if we can achieve better prediction performances.
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A Linear Classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector.
In this notebook, we use the scikit-learn implementation of a Linear SGD Classifier. SGD refers to Stochastic Gradient Descent, which is an iterative algorithm to find the target weights of the linear classifier. The feature vector in this case is a vector of pixel values from the image.
A few points to keep in mind when we use this classifier:
We need to build up feature scaling carefully and choose hyperparameters wisely.
Each image in the dataset has 784 features (28x28 pixels vectorized into 784x1 vector) and the value of each pixel ranges from 0 to 255. We use sklearn's sklearn.preprocessing.StandardScaler
class to perform feature scaling on the dataset so that the values are in weighted form and in a smaller range. The scaling formula is x_scaled = (x - x_mean) / x_standarddeviation
, which is also known as the z-score in statistical analysis. It means how many standard deviation is each point away from the mean value.
# Create an sklearn.preprocessing.StandardScaler instance
scaler = StandardScaler()
# Map pixels with the Scaler
pixel_scaled = scaler.fit_transform(pixel.astype(np.float64))
Build and train a Linear SGD Classifier from the training dataset using a combination of hyperparameters that performed well in these benchmarks and yields results quickly. We specify an arbitrary random number generator seed of 42 to allow for reproducible results.
# Create an sklearn.linear_model.SGDClassifier
sgd = SGDClassifier(loss='hinge', random_state=42, penalty='l2')
# train the classifier using the labels and the feature-scaled pixel values
sgd.fit(pixel_scaled, label)
Test the Linear SGD Classifier using the scaled pixel training data.
# Test classifier using the pixel data
label_predict = sgd.predict(pixel_scaled)
# display model performance stats
display_metrics(label, label_predict)
sgd_scores = cross_val_score(sgd, pixel_scaled, label, cv=3, scoring="accuracy")
display_scores(sgd_scores)
label_predcv = cross_val_predict(sgd, pixel_scaled, label, cv=3)
linear_classifier_cv = calculate_metrics(label,label_predcv)
It appears that the trained Linear SGD Classifier is performing better than the Decision Tree classifier. Let's try one more classifier.
In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1. Logistic regression is a supervised classification algorithm.
In this notebook we use the scikit-learn implementation of a Logistic Regression Classifier and apply a hyperparameter combination that performed well in these benchmarks and yields results quickly. We specify an arbitrary random number generator seed of 42 to allow for reproducible results.
# Create an sklearn.linear_model.LogisticRegression classifier
log = LogisticRegression(multi_class="ovr", penalty='l2', solver="lbfgs", C=10, random_state=42)
# train the classifier using the labels and the feature-scaled pixel values
log.fit(pixel_scaled, label)
Test the Logistic Regression Classifier using the feature-scaled pixel training data.
# predict dataset pixel_scaled using trained model
label_predict = log.predict(pixel_scaled)
# display model performance stats
display_metrics(label, label_predict)
log_scores = cross_val_score(log, pixel_scaled, label, cv=3, scoring="accuracy")
display_scores(log_scores)
label_predcv = cross_val_predict(log, pixel_scaled, label, cv=3)
log_regression_cv = calculate_metrics(label,label_predcv)
The prediction power of the Logistic Regression Classifier is slightly better than that of the SGD Classifier, comparing their 3-fold cross validation scores.
Let's compare the three model's cross validation performance side by side!
model_comparison_df = pd.DataFrame([decision_tree_cv, linear_classifier_cv, log_regression_cv],
columns =['Accuracy', 'Precision', 'Recall', 'F-Score'],
index=['decision_tree_cv', 'linear_classifier_cv', 'log_regression_cv'])
model_comparison_df
In our example a comparison of accuracy, precision, recall, and F-score indicates that the trained Linear Regression classfier would yield the best prediction results.
Part 3 - DL and Model Evaluations
notebook.This notebook was created by the Center for Open-Source Data & AI Technologies.
Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.